Data integration with STACAS
Introduction
STACAS is a method for scRNA-seq data integration. It is based on the Seurat integration framework, and adds the following innovations:
- anchors are filtered/down-weighted based on their distance in reciprocal PCA space, calculated from the unscaled, normalized data
- integration trees are constructed based on the ‘centrality’ of datasets, as measured by the sum of their anchor (distance-weighted) scores
- cell labels, if known, can be given as input to the algorithm to perform semi-supervised integration
In this demo we will show the application of STACAS to integrate a collection of scRNA-seq datasets of immune cells from multiple donors, human tissues and studies, assembled by Luecken et al. for their excellent benchmark.
The data are available at: figshare/12420968
R environment
Get and load some useful packages
Load test datasets
Download the dataset of human immune cells assembled by Luecken et al., and convert them to Seurat objects.
download <- F
where <- 'aux'
dir.create(where, showWarnings = FALSE)
rds.path <- sprintf("%s/Immune_ALL_human.rds", where)
if(download){
options(timeout=500)
url <- "https://figshare.com/ndownloader/files/25717328"
h5.path <- sprintf("%s/Immune_ALL_human.rds", where)
download.file(url = url, destfile = h5.path)
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if (!require("zellkonverter", quietly = TRUE))
install.packages("zellkonverter") # to convert from h5ad to R object
#Convert to Seurat
object.sce <- zellkonverter::readH5AD(h5.path)
object <- Seurat::as.Seurat(object.sce, counts = "counts", data = "X")
object <- RenameAssays(object, originalexp="RNA")
rm (object.sce)
Idents(object) <- "final_annotation"
saveRDS(object = object,file = rds.path)
}else{
object <- readRDS(rds.path)
}Cell types were annotated by the authors on each dataset
individually, using a common dictionary of cell types (see https://github.com/theislab/scib-reproducibility). These
are stored in the final_annotation metadata column. Study
of origin is stored in batch metadaa
##
## 10X Freytag Oetjen_A Oetjen_P Oetjen_U
## CD4+ T cells 2937 1238 577 295 1652
## CD8+ T cells 350 270 93 639 253
## CD10+ B cells 0 0 91 41 75
## CD14+ Monocytes 3388 452 350 263 384
## CD16+ Monocytes 364 25 61 26 78
## CD20+ B cells 1546 427 43 31 417
## Erythrocytes 0 0 186 1219 97
## Erythroid progenitors 0 0 112 249 102
## HSPCs 28 0 227 119 99
## Megakaryocyte progenitors 21 16 72 71 76
## Monocyte progenitors 0 0 183 109 136
## Monocyte-derived dendritic cells 182 0 89 60 65
## NK cells 756 476 26 0 63
## NKT cells 1056 432 389 62 157
## Plasma cells 18 0 33 40 38
## Plasmacytoid dendritic cells 81 11 54 41 38
How does the collection of datasets look without any integration?
Run a standard Seurat pipeline for dimensionality reduction
nfeatures <- 1000
ndim <- 20
object <- FindVariableFeatures(object, nfeatures = nfeatures) %>%
NormalizeData() %>% ScaleData() %>%
RunPCA(npcs=ndim) %>% RunUMAP(dims=1:ndim)p1_pre <- DimPlot(object, group.by = "batch") + theme(aspect.ratio = 1) + ggtitle("Dataset/batch before integration")
p2_pre <- DimPlot(object, group.by = "final_annotation", label=T, label.size = 4) + NoLegend() +
theme(aspect.ratio = 1) + ggtitle("Cell labels before integration")
p1_pre | p2_preAlthough cells mostly cluster by the cell type (as annotated in individual datasets), there are also visible batch effects (seen as dataset-specific clustering)
Let’s quantify dataset/batch mixing using the LISI metric from the Raychaudhuri Lab, and conservation of biological information by means of Average Silhouette Width (ASW) of cell labels in the corrected PCA space:
integrationMetrics <- list()
if (!require("scIntegrationMetrics", quietly = TRUE))
install_github("carmonalab/scIntegrationMetrics") #calculates LISI and Silhouette
library(scIntegrationMetrics)
lisi_perplexity <- 20
integrationMetrics[["uncorrected"]] <- list()
integrationMetrics[["uncorrected"]][["batch_LISI"]] <- mean(compute_lisi(object@reductions[["pca"]]@cell.embeddings, meta_data = object@meta.data, label_colnames="batch", perplexity = lisi_perplexity)[[1]]) # perplexity=effective number of each cell's neighbors; use lower than default to make it faster
integrationMetrics[["uncorrected"]][["celltype_ASW"]] <- mean(compute_silhouette(object@reductions[["pca"]]@cell.embeddings, meta_data = object@meta.data, label_colnames="final_annotation")[[1]])
integrationMetrics[["uncorrected"]]## $batch_LISI
## [1] 2.914892
##
## $celltype_ASW
## [1] 0.036829
We will now apply STACAS for correcting batch effects.
Standard integration
STACAS takes a list of Seurat objects as input, so let’s first split the merged object into a list of individual batches/datasets
STACAS Step 1: Find integration anchors
stacas_anchors <- FindAnchors.STACAS(obj.list,
anchor.features = nfeatures,
dims = 1:ndim)
## 1/10 2/10 3/10 4/10 5/10 6/10 7/10 8/10 9/10 10/10
## Finding integration anchors...STACAS step 2: Dataset integration
Calculate low-dimensional embeddings and visualize integration results in UMAP
object_integrated <- object_integrated %>% ScaleData() %>% RunPCA(npcs=ndim) %>% RunUMAP(dims=1:ndim)p1_int <- DimPlot(object_integrated, group.by = "batch") + theme(aspect.ratio = 1) + ggtitle("Dataset/batch after integration")
p2_int <- DimPlot(object_integrated, group.by = "final_annotation", label=T, label.size = 2) + NoLegend() +
theme(aspect.ratio = 1) + ggtitle("Cell labels after integration")
p1_int | p2_intQuantify integration metrics:
integrationMetrics[["STACAS"]] <- list()
integrationMetrics[["STACAS"]][["batch_LISI"]] <- mean(compute_lisi(object_integrated@reductions[["pca"]]@cell.embeddings, meta_data = object@meta.data, label_colnames="batch", perplexity = lisi_perplexity)[[1]])
integrationMetrics[["STACAS"]][["celltype_ASW"]] <- mean(compute_silhouette(object_integrated@reductions[["pca"]]@cell.embeddings, meta_data = object@meta.data, label_colnames="final_annotation")[[1]])
integrationMetrics[["STACAS"]]## $batch_LISI
## [1] 3.351884
##
## $celltype_ASW
## [1] 0.2295277
Compared to the non-corrected data, we now observe: i) increase in dataset/batch LISI (i.e. effective number of different batches in a cell neighborhood), indicating a higher mixing ii) increase in Silhouette/ASW, indicating that cells with the same annotation (ie of the same kind) were brought into proximity
One-liner STACAS
The previous integration could have been run from the initial object with a single command:
STACAS integration guide trees
Like Seurat, STACAS uses a guide tree to determine integration order. Optionally, you can check the integration guide tree automatically generated by STACAS (e.g. are samples from the same sequencing technology or from similar tissues clustering together?)
You can tune or manually re-calculate this tree and pass it to
IntegrateData.STACAS:
STACAS step 2 with pre-calculated integration tree
Semi-supervised integration
When available, cell type annotations can be used to guide the alignment. STACAS will use this information to penalize anchors where cell types are inconsistent.
In this dataset, cells were annotated by the authors of the
benchmark. For your own data, you may want to do manual annotation or
apply one of several cell annotation tools, such as SingleR,
Garnett or scGate. We will be
posting some examples of cell type annotation using scGate
in a different demo.
** One-liner semi-supervised STACAS ** by indicating in
cell.labels the metadata column that contains cell
annotations
object_integrated_ss <- obj.list %>% Run.STACAS(dims = 1:ndim, anchor.features = nfeatures, cell.labels = "final_annotation" )Note that there is no need for ALL cells to be annotated: we
recommend to set labels to NA or unknown for cells
that cannot be confidently annotated, and they won’t be penalized for
label inconsistency. In addition, you can decide how much weight to give
to cell labels with the label.confidence parameter (from 0
to 1).
Visualize on UMAP space
p1_ss <- DimPlot(object_integrated_ss, group.by = "batch") + theme(aspect.ratio = 1) + ggtitle("Dataset/batch after semi-supervised integration")
p2_ss <- DimPlot(object_integrated_ss, group.by = "final_annotation", label=T, label.size = 2) +
NoLegend() + theme(aspect.ratio = 1) + ggtitle("Cell labels after semi-supervised integration")
p1_ss | p2_ssQuantify integration metrics:
integrationMetrics[["semisupSTACAS"]] <- list()
integrationMetrics[["semisupSTACAS"]][["batch_LISI"]] <- mean(compute_lisi(object_integrated_ss@reductions[["pca"]]@cell.embeddings, meta_data = object@meta.data, label_colnames="batch", perplexity = lisi_perplexity)[[1]])
integrationMetrics[["semisupSTACAS"]][["celltype_ASW"]] <- mean(compute_silhouette(object_integrated_ss@reductions[["pca"]]@cell.embeddings, meta_data = object@meta.data, label_colnames="final_annotation")[[1]])
integrationMetrics[["semisupSTACAS"]]## $batch_LISI
## [1] 3.199746
##
## $celltype_ASW
## [1] 0.2474989
In this case, by using the semi-supervised approach we observe a gain in biological signal conservation, as measured by cell labels silhouette (celltype_ASW), with a similar dataset/batch mixing (batch_LISI) compared to the unsupervised integration.
Summary of Integration Metrics
if (!require("tidyr", quietly = TRUE))
install.packages("tidyr")
integrationMetricsSummary <- data.frame(unlist(integrationMetrics)) %>% tibble::rownames_to_column() %>% rename(value=unlist.integrationMetrics.) %>% separate(rowname, c("Method","Metric"), sep="\\.")
a <- integrationMetricsSummary %>% filter(Metric=="batch_LISI") %>%
ggplot(aes(x=Method, y=value, fill=Method)) + geom_bar(stat="identity") +
ggtitle("batch_LISI") + theme_bw() +
theme(legend.position="none", axis.text.x=element_blank())
b <- integrationMetricsSummary %>% filter(Metric=="celltype_ASW") %>%
ggplot(aes(x=Method, y=value, fill=Method)) + geom_bar(stat="identity") +
ggtitle("celltype_ASW") + theme_bw() +
theme(axis.text.x=element_blank())
a | bLabel transfer between datasets
In some integration tasks, cell type annotations may be available
only for a fraction of the datasets. In such cases, labels can be
“transferred” to unannotated datasets by similarity to annotated cells.
The annotate.by.neighbors function allows propagating cell
labels to unannotated cells by K-nearest neighbor similarity to
annotated cells.
For example, if one of the datasets were lacking cell type labels (set to NA):
Transfer labels from annotated to non-annotated cells
complete <- annotate.by.neighbors(incomplete, labels.col = "final_annotation")
a <- DimPlot(incomplete, group.by = "final_annotation", label=T, label.size = 4) + NoLegend() +
theme(aspect.ratio = 1) + ggtitle("Incomplete annotation")
b <- DimPlot(complete, group.by = "final_annotation", label=T, label.size = 4) + NoLegend() +
theme(aspect.ratio = 1) + ggtitle("Complete annotation")
a | bNotes on data integration
The first critical step in batch effect correction is to ensure that the gene names in your datasets are harmonized. If different studies use different naming conventions, synonyms, etc., methods for dimensionality reduction will treat these genes as entirely different variables, and introducing artificial differences between datasets. It is recommended that your pre-processing pipeline takes care of resolving ambiguities in gene naming across datasets. You can use
STACAS:::standardizeGeneSymbolsbut there are other tools such as HGNChelperThe calculation of highly variable genes (HVG) is a fundamental step for dimensionality reduction and integration. We recommend excluding certain classes of genes, e.g. mitochondrial, ribosomal, heat-shock genes, from the list of HVG, as their expression are typically more strongly associated to technical variation than to actual biological differences. The
FindVariableGenes.STACAS()function allows providing a list of genes to exclude from HVG; see an example below.
We can use the collection of signatures stored in SignatuR package
if (!require("SignatuR", quietly = TRUE))
install_github("carmonalab/SignatuR")
library(SignatuR)
print(SignatuR$Hs)## levelName
## 1 Hs
## 2 ¦--Blocklists
## 3 ¦ ¦--Pseudogenes
## 4 ¦ ¦--HSP
## 5 ¦ °--Non-coding
## 6 ¦--Cell_types
## 7 ¦--Programs
## 8 ¦ ¦--cellCycle.G1S
## 9 ¦ ¦--cellCycle.G2M
## 10 ¦ ¦--IFN
## 11 ¦ ¦--Tcell.cytotoxicity
## 12 ¦ ¦--Tcell.exhaustion
## 13 ¦ °--Tcell.stemness
## 14 °--Compartments
## 15 ¦--Mito
## 16 ¦--Ribo
## 17 ¦--TCR
## 18 °--Immunoglobulins
Then we tell STACAS to exclude specific genes from HVG used in
integration, using the parameter genesBlockList
my.genes.blocklist <- c(GetSignature(SignatuR$Hs$Blocklists),
GetSignature(SignatuR$Hs$Compartments))
object_integrated_blockList <- Run.STACAS(obj.list, genesBlockList = my.genes.blocklist,
dims = 1:ndim, anchor.features = nfeatures)## R version 4.3.0 (2023-04-21)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Monterey 12.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Europe/Zurich
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices datasets utils methods base
##
## other attached packages:
## [1] SignatuR_0.1.1 tidyr_1.3.0 scIntegrationMetrics_0.5
## [4] STACAS_2.1.1 ggplot2_3.4.2 dplyr_1.1.2
## [7] SeuratObject_4.1.3 Seurat_4.3.0.1 remotes_2.4.2
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 jsonlite_1.8.5 magrittr_2.0.3
## [4] spatstat.utils_3.0-3 farver_2.1.1 rmarkdown_2.22
## [7] vctrs_0.6.3 ROCR_1.0-11 spatstat.explore_3.2-1
## [10] htmltools_0.5.5 BiocNeighbors_1.18.0 sass_0.4.6
## [13] sctransform_0.3.5 parallelly_1.36.0 KernSmooth_2.23-21
## [16] bslib_0.5.0 htmlwidgets_1.6.2 ica_1.0-3
## [19] plyr_1.8.8 plotly_4.10.2 zoo_1.8-12
## [22] cachem_1.0.8 igraph_1.5.0 mime_0.12
## [25] lifecycle_1.0.3 pkgconfig_2.0.3 Matrix_1.5-4.1
## [28] R6_2.5.1 fastmap_1.1.1 fitdistrplus_1.1-11
## [31] future_1.32.0 shiny_1.7.4 digest_0.6.32
## [34] colorspace_2.1-0 S4Vectors_0.38.1 patchwork_1.1.2
## [37] tensor_1.5 irlba_2.3.5.1 vegan_2.6-4
## [40] labeling_0.4.2 progressr_0.13.0 fansi_1.0.4
## [43] spatstat.sparse_3.0-2 mgcv_1.8-42 httr_1.4.6
## [46] polyclip_1.10-4 abind_1.4-5 compiler_4.3.0
## [49] withr_2.5.0 BiocParallel_1.34.2 highr_0.10
## [52] R.utils_2.12.2 MASS_7.3-60 permute_0.9-7
## [55] tools_4.3.0 lmtest_0.9-40 httpuv_1.6.11
## [58] future.apply_1.11.0 goftest_1.2-3 R.oo_1.25.0
## [61] glue_1.6.2 nlme_3.1-162 promises_1.2.0.1
## [64] grid_4.3.0 Rtsne_0.16 cluster_2.1.4
## [67] reshape2_1.4.4 generics_0.1.3 gtable_0.3.3
## [70] spatstat.data_3.0-1 R.methodsS3_1.8.2 rmdformats_1.0.4
## [73] data.table_1.14.8 sp_2.0-0 utf8_1.2.3
## [76] BiocGenerics_0.46.0 spatstat.geom_3.2-1 RcppAnnoy_0.0.20
## [79] ggrepel_0.9.3 RANN_2.6.1 pillar_1.9.0
## [82] stringr_1.5.0 later_1.3.1 splines_4.3.0
## [85] lattice_0.21-8 renv_0.17.3 survival_3.5-5
## [88] deldir_1.0-9 tidyselect_1.2.0 miniUI_0.1.1.1
## [91] pbapply_1.7-2 knitr_1.43 gridExtra_2.3
## [94] bookdown_0.34 scattermore_1.2 stats4_4.3.0
## [97] xfun_0.39 matrixStats_1.0.0 stringi_1.7.12
## [100] lazyeval_0.2.2 yaml_2.3.7 evaluate_0.21
## [103] codetools_0.2-19 data.tree_1.0.0 tibble_3.2.1
## [106] BiocManager_1.30.21 cli_3.6.1 uwot_0.1.15
## [109] xtable_1.8-4 reticulate_1.30 munsell_0.5.0
## [112] jquerylib_0.1.4 Rcpp_1.0.10 globals_0.16.2
## [115] spatstat.random_3.1-5 png_0.1-8 parallel_4.3.0
## [118] ellipsis_0.3.2 listenv_0.9.0 viridisLite_0.4.2
## [121] scales_1.2.1 ggridges_0.5.4 leiden_0.4.3
## [124] purrr_1.0.1 rlang_1.1.1 cowplot_1.1.1
Further reading
The STACAS package and installation instructions are available at: STACAS package
The code for this demo can be found on GitHub
References
Andreatta A., Carmona S. J. (2021). STACAS: Sub-Type Anchor Correction for Alignment in Seurat to integrate single-cell RNA-seq data. - Bioinformatics
Luecken, M. D., Büttner, M., Chaichoompu, K., Danese, A., Interlandi, M., Müller, M. F., … & Theis, F. J. (2022). Benchmarking atlas-level data integration in single-cell genomics. - Nature methods
Hao, Y., Hao, S., Andersen-Nissen, E., Mauck III, W. M., Zheng, S., Butler, A., … & Satija, R. (2021). Integrated analysis of multimodal single-cell data. - Cell